{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Anomaly Detection\n",
    "\n",
    "In this exercise, you will implement the anomaly detection algorithm and apply it to detect failing servers on a network. \n",
    "\n",
    "\n",
    "\n",
    "# Outline\n",
    "- [ 1 - Packages ](#1)\n",
    "- [ 2 - Anomaly detection](#2)\n",
    "  - [ 2.1 Problem Statement](#2.1)\n",
    "  - [ 2.2  Dataset](#2.2)\n",
    "  - [ 2.3 Gaussian distribution](#2.3)\n",
    "    - [ Exercise 1](#ex01)\n",
    "    - [ Exercise 2](#ex02)\n",
    "  - [ 2.4 High dimensional dataset](#2.4)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"1\"></a>\n",
    "## 1 - Packages \n",
    "\n",
    "First, let's run the cell below to import all the packages that you will need during this assignment.\n",
    "- [numpy](www.numpy.org) is the fundamental package for working with matrices in Python.\n",
    "- [matplotlib](http://matplotlib.org) is a famous library to plot graphs in Python.\n",
    "- ``utils.py`` contains helper functions for this assignment. You do not need to modify code in this file.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from utils import *\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"2\"></a>\n",
    "## 2 - Anomaly detection\n",
    "\n",
    "<a name=\"2.1\"></a>\n",
    "### 2.1 Problem Statement\n",
    "\n",
    "In this exercise, you will implement an anomaly detection algorithm to\n",
    "detect anomalous behavior in server computers.\n",
    "\n",
    "The dataset contains two features - \n",
    "   * throughput (mb/s) and \n",
    "   * latency (ms) of response of each server.\n",
    "\n",
    "While your servers were operating, you collected $m=307$ examples of how they were behaving, and thus have an unlabeled dataset $\\{x^{(1)}, \\ldots, x^{(m)}\\}$. \n",
    "* You suspect that the vast majority of these examples are “normal” (non-anomalous) examples of the servers operating normally, but there might also be some examples of servers acting anomalously within this dataset.\n",
    "\n",
    "You will use a Gaussian model to detect anomalous examples in your\n",
    "dataset. \n",
    "* You will first start on a 2D dataset that will allow you to visualize what the algorithm is doing.\n",
    "* On that dataset you will fit a Gaussian distribution and then find values that have very low probability and hence can be considered anomalies. \n",
    "* After that, you will apply the anomaly detection algorithm to a larger dataset with many dimensions. \n",
    "\n",
    "<a name=\"2.2\"></a>\n",
    "### 2.2  Dataset\n",
    "\n",
    "You will start by loading the dataset for this task. \n",
    "- The `load_data()` function shown below loads the data into the variables `X_train`, `X_val` and `y_val` \n",
    "    - You will use `X_train` to fit a Gaussian distribution \n",
    "    - You will use `X_val` and `y_val` as a cross validation set to select a threshold and determine anomalous vs normal examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the dataset\n",
    "X_train, X_val, y_val = load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### View the variables\n",
    "Let's get more familiar with your dataset.  \n",
    "- A good place to start is to just print out each variable and see what it contains.\n",
    "\n",
    "The code below prints the first five elements of each of the variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the first five elements of X_train\n",
    "print(\"The first 5 elements of X_train are:\\n\", X_train[:5])  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the first five elements of X_val\n",
    "print(\"The first 5 elements of X_val are\\n\", X_val[:5])  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display the first five elements of y_val\n",
    "print(\"The first 5 elements of y_val are\\n\", y_val[:5])  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Check the dimensions of your variables\n",
    "\n",
    "Another useful way to get familiar with your data is to view its dimensions.\n",
    "\n",
    "The code below prints the shape of `X_train`, `X_val` and `y_val`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print ('The shape of X_train is:', X_train.shape)\n",
    "print ('The shape of X_val is:', X_val.shape)\n",
    "print ('The shape of y_val is: ', y_val.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Visualize your data\n",
    "\n",
    "Before starting on any task, it is often useful to understand the data by visualizing it. \n",
    "- For this dataset, you can use a scatter plot to visualize the data (`X_train`), since it has only two properties to plot (throughput and latency)\n",
    "\n",
    "- Your plot should look similar to the one below\n",
    "<img src=\"images/figure1.png\" width=\"500\" height=\"500\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a scatter plot of the data. To change the markers to blue \"x\",\n",
    "# we used the 'marker' and 'c' parameters\n",
    "plt.scatter(X_train[:, 0], X_train[:, 1], marker='x', c='b') \n",
    "\n",
    "# Set the title\n",
    "plt.title(\"The first dataset\")\n",
    "# Set the y-axis label\n",
    "plt.ylabel('Throughput (mb/s)')\n",
    "# Set the x-axis label\n",
    "plt.xlabel('Latency (ms)')\n",
    "# Set axis range\n",
    "plt.axis([0, 30, 0, 30])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"2.3\"></a>\n",
    "### 2.3 Gaussian distribution\n",
    "\n",
    "To perform anomaly detection, you will first need to fit a model to the data’s distribution.\n",
    "\n",
    "* Given a training set $\\{x^{(1)}, ..., x^{(m)}\\}$ you want to estimate the Gaussian distribution for each\n",
    "of the features $x_i$. \n",
    "\n",
    "* Recall that the Gaussian distribution is given by\n",
    "\n",
    "   $$ p(x ; \\mu,\\sigma ^2) = \\frac{1}{\\sqrt{2 \\pi \\sigma ^2}}\\exp^{ - \\frac{(x - \\mu)^2}{2 \\sigma ^2} }$$\n",
    "\n",
    "   where $\\mu$ is the mean and $\\sigma^2$ controls the variance.\n",
    "   \n",
    "* For each feature $i = 1\\ldots n$, you need to find parameters $\\mu_i$ and $\\sigma_i^2$ that fit the data in the $i$-th dimension $\\{x_i^{(1)}, ..., x_i^{(m)}\\}$ (the $i$-th dimension of each example).\n",
    "\n",
    "### 2.2.1 Estimating parameters for a Gaussian\n",
    "\n",
    "**Implementation**: \n",
    "\n",
    "Your task is to complete the code in `estimate_gaussian` below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"ex01\"></a>\n",
    "### Exercise 1\n",
    "\n",
    "Please complete the `estimate_gaussian` function below to calculate `mu` (mean for each feature in `X`)and `var` (variance for each feature in `X`). \n",
    "\n",
    "You can estimate the parameters, ($\\mu_i$, $\\sigma_i^2$), of the $i$-th\n",
    "feature by using the following equations. To estimate the mean, you will\n",
    "use:\n",
    "\n",
    "$$\\mu_i = \\frac{1}{m} \\sum_{j=1}^m x_i^{(j)}$$\n",
    "\n",
    "and for the variance you will use:\n",
    "$$\\sigma_i^2 = \\frac{1}{m} \\sum_{j=1}^m (x_i^{(j)} - \\mu_i)^2$$\n",
    "\n",
    "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C1\n",
    "# GRADED FUNCTION: estimate_gaussian\n",
    "\n",
    "def estimate_gaussian(X): \n",
    "    \"\"\"\n",
    "    Calculates mean and variance of all features \n",
    "    in the dataset\n",
    "    \n",
    "    Args:\n",
    "        X (ndarray): (m, n) Data matrix\n",
    "    \n",
    "    Returns:\n",
    "        mu (ndarray): (n,) Mean of all features\n",
    "        var (ndarray): (n,) Variance of all features\n",
    "    \"\"\"\n",
    "\n",
    "    m, n = X.shape\n",
    "    \n",
    "    ### START CODE HERE ### \n",
    "    \n",
    "    ### END CODE HERE ### \n",
    "        \n",
    "    return mu, var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "  <summary><font size=\"3\" color=\"darkgreen\"><b>Click for hints</b></font></summary>\n",
    "  \n",
    "   * You can implement this function in two ways: \n",
    "      * 1 - by having two nested for loops - one looping over the **columns** of `X` (each feature) and then looping over each data point. \n",
    "      * 2 - in a vectorized manner by using `np.sum()` with `axis = 0` parameter (since we want the sum for each column)\n",
    "\n",
    "    \n",
    "   * Here's how you can structure the overall implementation of this function for the vectorized implementation:\n",
    "     ```python  \n",
    "    def estimate_gaussian(X): \n",
    "        m, n = X.shape\n",
    "    \n",
    "        ### START CODE HERE ### \n",
    "        mu = # Your code here to calculate the mean of every feature\n",
    "        var = # Your code here to calculate the variance of every feature \n",
    "        ### END CODE HERE ### \n",
    "        \n",
    "        return mu, var\n",
    "    ```\n",
    "\n",
    "    If you're still stuck, you can check the hints presented below to figure out how to calculate `mu` and `var`.\n",
    "    \n",
    "    <details>\n",
    "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate mu</b></font></summary>\n",
    "           &emsp; &emsp; You can use <a href=\"https://numpy.org/doc/stable/reference/generated/numpy.sum.html\">np.sum</a> to with `axis = 0` parameter to get the sum for each column of an array\n",
    "          <details>\n",
    "              <summary><font size=\"2\" color=\"blue\"><b>&emsp; &emsp; More hints to calculate mu</b></font></summary>\n",
    "               &emsp; &emsp; You can compute mu as <code>mu = 1 / m * np.sum(X, axis = 0)</code>\n",
    "           </details>\n",
    "    </details>\n",
    "    \n",
    "    <details>\n",
    "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate var</b></font></summary>\n",
    "           &emsp; &emsp; You can use <a href=\"https://numpy.org/doc/stable/reference/generated/numpy.sum.html\">np.sum</a> to with `axis = 0` parameter to get the sum for each column of an array and <code>**2</code> to get the square.\n",
    "          <details>\n",
    "              <summary><font size=\"2\" color=\"blue\"><b>&emsp; &emsp; More hints to calculate var</b></font></summary>\n",
    "               &emsp; &emsp; You can compute var as <code> var = 1 / m * np.sum((X - mu) ** 2, axis = 0)</code>\n",
    "           </details>\n",
    "    </details>\n",
    "    \n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can check if your implementation is correct by running the following test code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Estimate mean and variance of each feature\n",
    "mu, var = estimate_gaussian(X_train)              \n",
    "\n",
    "print(\"Mean of each feature:\", mu)\n",
    "print(\"Variance of each feature:\", var)\n",
    "    \n",
    "# UNIT TEST\n",
    "from public_tests import *\n",
    "estimate_gaussian_test(estimate_gaussian)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:\n",
    "<table>\n",
    "  <tr>\n",
    "    <td> <b>Mean of each feature: <b>  </td> \n",
    "    <td> [14.11222578 14.99771051]</td> \n",
    "   </tr>    \n",
    "   <tr>\n",
    "    <td> <b>Variance of each feature: <b>  </td>\n",
    "     <td> [1.83263141 1.70974533] </td> \n",
    "  </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that you have completed the code in `estimate_gaussian`, we will visualize the contours of the fitted Gaussian distribution. \n",
    "\n",
    "You should get a plot similar to the figure below. \n",
    "<img src=\"images/figure2.png\" width=\"500\" height=\"500\">\n",
    "\n",
    "\n",
    "From your plot you can see that most of the examples are in the region with the highest probability, while the anomalous examples are in the regions with lower probabilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Returns the density of the multivariate normal\n",
    "# at each data point (row) of X_train\n",
    "p = multivariate_gaussian(X_train, mu, var)\n",
    "\n",
    "#Plotting code \n",
    "visualize_fit(X_train, mu, var)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2.2 Selecting the threshold $\\epsilon$\n",
    "\n",
    "Now that you have estimated the Gaussian parameters, you can investigate which examples have a very high probability given this distribution and which examples have a very low probability.  \n",
    "\n",
    "* The low probability examples are more likely to be the anomalies in our dataset. \n",
    "* One way to determine which examples are anomalies is to select a threshold based on a cross validation set. \n",
    "\n",
    "In this section, you will complete the code in `select_threshold` to select the threshold $\\varepsilon$ using the $F_1$ score on a cross validation set.\n",
    "\n",
    "* For this, we will use a cross validation set\n",
    "$\\{(x_{\\rm cv}^{(1)}, y_{\\rm cv}^{(1)}),\\ldots, (x_{\\rm cv}^{(m_{\\rm cv})}, y_{\\rm cv}^{(m_{\\rm cv})})\\}$, where the label $y=1$ corresponds to an anomalous example, and $y=0$ corresponds to a normal example. \n",
    "* For each cross validation example, we will compute $p(x_{\\rm cv}^{(i)})$. The vector of all of these probabilities $p(x_{\\rm cv}^{(1)}), \\ldots, p(x_{\\rm cv}^{(m_{\\rm cv)}})$ is passed to `select_threshold` in the vector `p_val`. \n",
    "* The corresponding labels $y_{\\rm cv}^{(1)}, \\ldots, y_{\\rm cv}^{(m_{\\rm cv)}}$ is passed to the same function in the vector `y_val`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"ex02\"></a>\n",
    "### Exercise 2\n",
    "Please complete the `select_threshold` function below to find the best threshold to use for selecting outliers based on the results from a validation set (`p_val`) and the ground truth (`y_val`). \n",
    "\n",
    "* In the provided code `select_threshold`, there is already a loop that will try many different values of $\\varepsilon$ and select the best $\\varepsilon$ based on the $F_1$ score. \n",
    "\n",
    "* You need implement code to calculate the F1 score from choosing `epsilon` as the threshold and place the value in `F1`. \n",
    "\n",
    "  * Recall that if an example $x$ has a low probability $p(x) < \\varepsilon$, then it is classified as an anomaly. \n",
    "        \n",
    "  * Then, you can compute precision and recall by: \n",
    "   $$\\begin{aligned}\n",
    "   prec&=&\\frac{tp}{tp+fp}\\\\\n",
    "   rec&=&\\frac{tp}{tp+fn},\n",
    "   \\end{aligned}$$ where\n",
    "    * $tp$ is the number of true positives: the ground truth label says it’s an anomaly and our algorithm correctly classified it as an anomaly.\n",
    "    * $fp$ is the number of false positives: the ground truth label says it’s not an anomaly, but our algorithm incorrectly classified it as an anomaly.\n",
    "    * $fn$ is the number of false negatives: the ground truth label says it’s an anomaly, but our algorithm incorrectly classified it as not being anomalous.\n",
    "\n",
    "  * The $F_1$ score is computed using precision ($prec$) and recall ($rec$) as follows:\n",
    "    $$F_1 = \\frac{2\\cdot prec \\cdot rec}{prec + rec}$$ \n",
    "\n",
    "**Implementation Note:** \n",
    "In order to compute $tp$, $fp$ and $fn$, you may be able to use a vectorized implementation rather than loop over all the examples.\n",
    "\n",
    "\n",
    "If you get stuck, you can check out the hints presented after the cell below to help you with the implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNQ_C2\n",
    "# GRADED FUNCTION: select_threshold\n",
    "\n",
    "def select_threshold(y_val, p_val): \n",
    "    \"\"\"\n",
    "    Finds the best threshold to use for selecting outliers \n",
    "    based on the results from a validation set (p_val) \n",
    "    and the ground truth (y_val)\n",
    "    \n",
    "    Args:\n",
    "        y_val (ndarray): Ground truth on validation set\n",
    "        p_val (ndarray): Results on validation set\n",
    "        \n",
    "    Returns:\n",
    "        epsilon (float): Threshold chosen \n",
    "        F1 (float):      F1 score by choosing epsilon as threshold\n",
    "    \"\"\" \n",
    "\n",
    "    best_epsilon = 0\n",
    "    best_F1 = 0\n",
    "    F1 = 0\n",
    "    \n",
    "    step_size = (max(p_val) - min(p_val)) / 1000\n",
    "    \n",
    "    for epsilon in np.arange(min(p_val), max(p_val), step_size):\n",
    "    \n",
    "        ### START CODE HERE ### \n",
    "        \n",
    "        ### END CODE HERE ### \n",
    "        \n",
    "        if F1 > best_F1:\n",
    "            best_F1 = F1\n",
    "            best_epsilon = epsilon\n",
    "        \n",
    "    return best_epsilon, best_F1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<details>\n",
    "  <summary><font size=\"3\" color=\"darkgreen\"><b>Click for hints</b></font></summary>\n",
    "\n",
    "   * Here's how you can structure the overall implementation of this function for the vectorized implementation:\n",
    "     ```python  \n",
    "    def select_threshold(y_val, p_val): \n",
    "        best_epsilon = 0\n",
    "        best_F1 = 0\n",
    "        F1 = 0\n",
    "    \n",
    "        step_size = (max(p_val) - min(p_val)) / 1000\n",
    "    \n",
    "        for epsilon in np.arange(min(p_val), max(p_val), step_size):\n",
    "    \n",
    "            ### START CODE HERE ### \n",
    "            predictions = # Your code here to calculate predictions for each example using epsilon as threshold\n",
    "        \n",
    "            tp = # Your code here to calculate number of true positives\n",
    "            fp = # Your code here to calculate number of false positives\n",
    "            fn = # Your code here to calculate number of false negatives\n",
    "        \n",
    "            prec = # Your code here to calculate precision\n",
    "            rec = # Your code here to calculate recall\n",
    "        \n",
    "            F1 = # Your code here to calculate F1\n",
    "            ### END CODE HERE ### \n",
    "        \n",
    "            if F1 > best_F1:\n",
    "                best_F1 = F1\n",
    "                best_epsilon = epsilon\n",
    "        \n",
    "        return best_epsilon, best_F1\n",
    "    ```\n",
    "\n",
    "    If you're still stuck, you can check the hints presented below to figure out how to calculate each variable.\n",
    "    \n",
    "    <details>\n",
    "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate predictions</b></font></summary>\n",
    "           &emsp; &emsp; If an example  𝑥  has a low probability  $p(x) < \\epsilon$ , then it is classified as an anomaly. To get predictions for each example (0/ False for normal and 1/True for anomaly), you can use <code>predictions = (p_val < epsilon)</code>\n",
    "    </details>\n",
    "    \n",
    "    <details>\n",
    "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate tp, fp, fn</b></font></summary>\n",
    "           &emsp; &emsp; \n",
    "        <ul>\n",
    "          <li>If you have several binary values in an $n$-dimensional\n",
    "binary vector, you can find out how many values in this vector are 0 by using:  `np.sum(v == 0)`</li>\n",
    "          <li>You can also apply a logical *and* operator to such binary vectors. For instance,  `predictions` is a binary vector of the size of your number of cross validation set, where the $i$-th element is 1 if your algorithm considers $x_{\\rm cv}^{(i)}$ an anomaly, and 0 otherwise. </li>\n",
    "          <li>You can then, for example, compute the number of false positives using:  \n",
    "<code>fp = sum((predictions == 1) & (y_val == 0))</code>.</li>\n",
    "        </ul>\n",
    "         <details>\n",
    "              <summary><font size=\"2\" color=\"blue\"><b>&emsp; &emsp; More hints to calculate tp, fn</b></font></summary>\n",
    "               &emsp; &emsp;\n",
    "             <ul>\n",
    "              <li>You can compute tp as <code> tp = np.sum((predictions == 1) & (y_val == 1))</code></li>\n",
    "              <li>You can compute tn as <code> fn = np.sum((predictions == 0) & (y_val == 1))</code></li>  \n",
    "              </ul>\n",
    "          </details>\n",
    "    </details>\n",
    "        \n",
    "    <details>\n",
    "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate precision</b></font></summary>\n",
    "           &emsp; &emsp; You can calculate precision as <code>prec = tp / (tp + fp)</code>\n",
    "    </details>\n",
    "        \n",
    "    <details>\n",
    "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate recall</b></font></summary>\n",
    "           &emsp; &emsp; You can calculate recall as <code>rec = tp / (tp + fn)</code>\n",
    "    </details>\n",
    "        \n",
    "    <details>\n",
    "          <summary><font size=\"2\" color=\"darkblue\"><b>Hint to calculate F1</b></font></summary>\n",
    "           &emsp; &emsp; You can calculate F1 as <code>F1 = 2 * prec * rec / (prec + rec)</code>\n",
    "    </details>\n",
    "    \n",
    "</details>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can check your implementation using the code below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_val = multivariate_gaussian(X_val, mu, var)\n",
    "epsilon, F1 = select_threshold(y_val, p_val)\n",
    "\n",
    "print('Best epsilon found using cross-validation: %e' % epsilon)\n",
    "print('Best F1 on Cross Validation Set: %f' % F1)\n",
    "    \n",
    "# UNIT TEST\n",
    "select_threshold_test(select_threshold)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:\n",
    "<table>\n",
    "  <tr>\n",
    "    <td> <b>Best epsilon found using cross-validation: <b>  </td> \n",
    "    <td> 8.99e-05</td> \n",
    "   </tr>    \n",
    "   <tr>\n",
    "    <td> <b>Best F1 on Cross Validation Set: <b>  </td>\n",
    "     <td> 0.875 </td> \n",
    "  </tr>\n",
    "</table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will run your anomaly detection code and circle the anomalies in the plot (Figure 3 below).\n",
    "\n",
    "<img src=\"images/figure3.png\" width=\"500\" height=\"500\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find the outliers in the training set \n",
    "outliers = p < epsilon\n",
    "\n",
    "# Visualize the fit\n",
    "visualize_fit(X_train, mu, var)\n",
    "\n",
    "# Draw a red circle around those outliers\n",
    "plt.plot(X_train[outliers, 0], X_train[outliers, 1], 'ro',\n",
    "         markersize= 10,markerfacecolor='none', markeredgewidth=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a name=\"2.4\"></a>\n",
    "### 2.4 High dimensional dataset\n",
    "\n",
    "Now,  we will run the anomaly detection algorithm that you implemented on a more realistic and much harder dataset.\n",
    "\n",
    "In this dataset, each example is described by 11 features, capturing many more properties of your compute servers.\n",
    "\n",
    "Let's start by loading the dataset.\n",
    "\n",
    "- The `load_data()` function shown below loads the data into variables `X_train_high`, `X_val_high` and `y_val_high`\n",
    "    -  `_high` is meant to distinguish these variables from the ones used in the previous part\n",
    "    - We will use `X_train_high` to fit Gaussian distribution \n",
    "    - We will use `X_val_high` and `y_val_high` as a cross validation set to select a threshold and determine anomalous vs normal examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the dataset\n",
    "X_train_high, X_val_high, y_val_high = load_data_multi()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Check the dimensions of your variables\n",
    "\n",
    "Let's check the dimensions of these new variables to become familiar with the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print ('The shape of X_train_high is:', X_train_high.shape)\n",
    "print ('The shape of X_val_high is:', X_val_high.shape)\n",
    "print ('The shape of y_val_high is: ', y_val_high.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Anomaly detection \n",
    "\n",
    "Now, let's run the anomaly detection algorithm on this new dataset.\n",
    "\n",
    "The code below will use your code to \n",
    "* Estimate the Gaussian parameters ($\\mu_i$ and $\\sigma_i^2$)\n",
    "* Evaluate the probabilities for both the training data `X_train_high` from which you estimated the Gaussian parameters, as well as for the the cross-validation set `X_val_high`. \n",
    "* Finally, it will use `select_threshold` to find the best threshold $\\varepsilon$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply the same steps to the larger dataset\n",
    "\n",
    "# Estimate the Gaussian parameters\n",
    "mu_high, var_high = estimate_gaussian(X_train_high)\n",
    "\n",
    "# Evaluate the probabilites for the training set\n",
    "p_high = multivariate_gaussian(X_train_high, mu_high, var_high)\n",
    "\n",
    "# Evaluate the probabilites for the cross validation set\n",
    "p_val_high = multivariate_gaussian(X_val_high, mu_high, var_high)\n",
    "\n",
    "# Find the best threshold\n",
    "epsilon_high, F1_high = select_threshold(y_val_high, p_val_high)\n",
    "\n",
    "print('Best epsilon found using cross-validation: %e'% epsilon_high)\n",
    "print('Best F1 on Cross Validation Set:  %f'% F1_high)\n",
    "print('# Anomalies found: %d'% sum(p_high < epsilon_high))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output**:\n",
    "<table>\n",
    "  <tr>\n",
    "    <td> <b>Best epsilon found using cross-validation: <b>  </td> \n",
    "    <td> 1.38e-18</td> \n",
    "   </tr>    \n",
    "   <tr>\n",
    "    <td> <b>Best F1 on Cross Validation Set: <b>  </td>\n",
    "     <td> 0.615385 </td> \n",
    "  </tr>\n",
    "    <tr>\n",
    "    <td> <b># anomalies found: <b>  </td>\n",
    "     <td>  117 </td> \n",
    "  </tr>\n",
    "</table>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
